Computer Vision is actually a realm of disciplines. For the sake of simplicity, this document will divide it into two main categories: Image and video processing and imageand video analysis. A third category is computer graphics, which deals with the display and rendering of images and videos. Like speech synthesis, this is a separate field and is not regularly using machine learning techniques.
Image and Video Processing
Image Processing is a rather traditional field. Many technologies in there are not considered machine learning techniques but simply math operations. Much of it is derived from the field of signal processing. The most important tools are: Fast FourierTransform, Convolution with Kernels, and Morphologic Operations. Using these, an image can be blurred, denoised, edges can be detected, and so on. Other important operations include resizing and color correction. Image and video compression have dominated the field in the past decades. An overview is for example provided by: Al Bovik: Handbook of Image and Video Processing, Second Edition, Elsevier Academic Press, Burlington, MA, USA, 2005. ISBN: 0-12-119792-1.
Image and Video Analysis
Image and Video analysis deal with the handling of the content of images and videos. Main subfields are: image and video retrieval (finding all images that contain object x), image and video segmentation (finding the exact boundaries of image objects or video scenes), object recognition (detecting the particular objects, e.g. is there a face or a person in the image), and tracking (what is the location of a particular object).Images and videos require a relatively high sampling, measured in dots per inch or pixels. Therefore it is very rare that images and videos are actually stored in an uncompressed way. Unlike speech algorithms, computer vision algorithms therefore have to be invariant against various compression artifacts, although they mostly work on uncompressed data. Like in speech processing (see above), this has several consequences:
a) Image and video processing, especially, if it is to be performed online and in realtime cannot rely on highly elaborated machine learning techniques. One hopes to find features that can be thresholded easily.
b) Scientific progress is considered fast-pace. A typical publication is 8-10 pages (in speech 4-6) double column. c) Image and video processing is just starting to get a benchmarking culture.
d) The majority of the approaches seek to work online i.e., realtime and incrementally as new data comes in because there is a large range of consumer demand for image and video processing methods that are applied in editors.
Similar to speech processing, image and video analysis usually relies on probablistic methods. Machine learning techniques used for various tasks include: Gaussian MixtureModels (GMMs), Neural Networks also called Multi-Layer Perceptrons (NNs or MLPs),Support-Vector Machines (SVMs), and Hidden Markov Models (HMMs). However, for many problems, non-probablistic methods have also shown to work, here distancemetrics play a major role. Feature extraction is an important research part of every paper. Other than SIFT for image retrieval, there is actually no standardized or commonly used set of features, although 8x8-block DCT coefficients and optical flow (the set of all motion vectors) seem to be very predominant. Usually, the color space of an image or video is discussed, with the standard spaces being RGB, YUV, HSI (or HSV), and recently LAB. Edge detection (also called shape extraction) and colorhistograms are both rather simple and effective for various tasks and are therefore commonly used.In image retrieval, common datasets are often used in order to make results comparable. Known datasets include: Corel Stock Photo Library or LabelMe by MIT CSAIL. Accuracy is usually measured in Precision, Recall, and F-measure (synonym for F-Score). NIST provides a set of tasks and a dataset that is evaluated regularly under TrecVid. The Clear evaluation was also initiated by NIST. Other than those, many benchmarks and datasets exists created by individual institutions or researchers (e.g. the Berkeley Image Segmentation Dataset and Benchmark).
An Example: Image Segmentation
Object extraction from images and videos (interactive or non-interactive image and video segmentation) is an important field and in computer vision. Here is an example of how a semi-automatic object extractor in GIMP works.- User InteractionGiven an image, a free-hand selection tool is used to specify the region of interest. It must contain all foreground objects to extract and as few background as possible. The pixels outside the region of interest form the sure background while the inner region define a superset of the foreground, i.e. the unknown region. A so-called foreground brush is then used to mark representative foreground regions. The algorithm outputs a selection mask. The selection can be refined by either adding further foreground markings or by adding background markings using the background brush.
- Feature Extraction: The algorithm then converts all pixel into CIELAB space. - Model building: A set of representative colors for sure foreground and sure background, the so-called color signatures, are created by a clustering technique.
- Classification: All image pixels are then assigned to foreground or background by a weighted nearest neighbor search in the color signatures.
- Postprocessing: Standard image processing operations like erode, dilate, and blur are applied to remove artifacts and the largest connected foreground component is found
Computer Vision Culture
Computer Vision is actually a realm of disciplines. For the sake of simplicity, this document will divide it into two main categories: Image and video processing and image and video analysis. A third category is computer graphics, which deals with the display and rendering of images and videos. Like speech synthesis, this is a separate field and is not regularly using machine learning techniques.
Image and Video Processing
Image Processing is a rather traditional field. Many technologies in there are not considered machine learning techniques but simply math operations. Much of it is derived from the field of signal processing. The most important tools are: Fast Fourier Transform, Convolution with Kernels, and Morphologic Operations. Using these, an image can be blurred, denoised, edges can be detected, and so on. Other important operations include resizing and color correction. Image and video compression have dominated the field in the past decades. An overview is for example provided by: Al Bovik: Handbook of Image and Video Processing, Second Edition, Elsevier Academic Press, Burlington, MA, USA, 2005. ISBN: 0-12-119792-1.Image and Video Analysis
Image and Video analysis deal with the handling of the content of images and videos. Main subfields are: image and video retrieval (finding all images that contain object x), image and video segmentation (finding the exact boundaries of image objects or video scenes), object recognition (detecting the particular objects, e.g. is there a face or a person in the image), and tracking (what is the location of a particular object).Images and videos require a relatively high sampling, measured in dots per inch or pixels. Therefore it is very rare that images and videos are actually stored in an uncompressed way. Unlike speech algorithms, computer vision algorithms therefore have to be invariant against various compression artifacts, although they mostly work on uncompressed data. Like in speech processing (see above), this has several consequences:a) Image and video processing, especially, if it is to be performed online and in realtime cannot rely on highly elaborated machine learning techniques. One hopes to find features that can be thresholded easily.
b) Scientific progress is considered fast-pace. A typical publication is 8-10 pages (in speech 4-6) double column.
c) Image and video processing is just starting to get a benchmarking culture.
d) The majority of the approaches seek to work online i.e., realtime and incrementally as new data comes in because there is a large range of consumer demand for image and video processing methods that are applied in editors.
Similar to speech processing, image and video analysis usually relies on probablistic methods. Machine learning techniques used for various tasks include: Gaussian Mixture Models (GMMs), Neural Networks also called Multi-Layer Perceptrons (NNs or MLPs), Support-Vector Machines (SVMs), and Hidden Markov Models (HMMs). However, for many problems, non-probablistic methods have also shown to work, here distancemetrics play a major role. Feature extraction is an important research part of every paper. Other than SIFT for image retrieval, there is actually no standardized or commonly used set of features, although 8x8-block DCT coefficients and optical flow (the set of all motion vectors) seem to be very predominant. Usually, the color space of an image or video is discussed, with the standard spaces being RGB, YUV, HSI (or HSV), and recently LAB. Edge detection (also called shape extraction) and color histograms are both rather simple and effective for various tasks and are therefore commonly used.In image retrieval, common datasets are often used in order to make results comparable. Known datasets include: Corel Stock Photo Library or LabelMe by MIT CSAIL. Accuracy is usually measured in Precision, Recall, and F-measure (synonym for F-Score). NIST provides a set of tasks and a dataset that is evaluated regularly under TrecVid. The Clear evaluation was also initiated by NIST. Other than those, many benchmarks and datasets exists created by individual institutions or researchers (e.g. the Berkeley Image Segmentation Dataset and Benchmark).
An Example: Image Segmentation
Object extraction from images and videos (interactive or non-interactive image and video segmentation) is an important field and in computer vision. Here is an example of how a semi-automatic object extractor in GIMP works.- User InteractionGiven an image, a free-hand selection tool is used to specify the region of interest. It must contain all foreground objects to extract and as few background as possible. The pixels outside the region of interest form the sure background while the inner region define a superset of the foreground, i.e. the unknown region. A so-called foreground brush is then used to mark representative foreground regions. The algorithm outputs a selection mask. The selection can be refined by either adding further foreground markings or by adding background markings using the background brush.- Feature Extraction: The algorithm then converts all pixel into CIELAB space.
- Model building: A set of representative colors for sure foreground and sure background, the so-called color signatures, are created by a clustering technique.
- Classification: All image pixels are then assigned to foreground or background by a weighted nearest neighbor search in the color signatures.
- Postprocessing: Standard image processing operations like erode, dilate, and blur are applied to remove artifacts and the largest connected foreground component is found